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Key findings 

Some states that evaluate teachers based partly on student 
learning use the student growth percentile model, which 
computes a score that is assumed to reflect a teacher’s 
current and future effectiveness. This study in a Nevada 
school district finds that half or more of the variance in 
teacher scores from the model is due to random or otherwise 
unstable sources rather than to reliable information that 
could predict future performance. Even when derived by 
averaging several years of teacher scores, effectiveness 
estimates are unlikely to provide a level of reliability desired 
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Summary 


States across the nation are developing new systems to evaluate teachers based on class- 
room observations, how much students learn, or some combination of these and other 
factors. Evaluations under these systems can have high stakes for teachers. Poor evalua- 
tions may lead to a frozen salary, mandatory remediation, or dismissal, while exceptional 
evaluations may be rewarded with a salary increase or tenure. 

This study tests an implicit assumption of high-stakes teacher evaluation systems that use 
student learning to measure teacher effectiveness: that the learning of a teacher’s students 
in one year will predict the learning of the teacher’s future students. Evaluation systems 
that identify low-scoring teachers for remediation assume that if the teachers are not 
retrained, their future teaching will also be relatively ineffective. Systems that award tenure 
to teachers who score higher assume that those teachers will remain effective. Examining 
the stability of teacher-level growth scores over time provides evidence of the validity of 
the interpretation and use of such scores for teacher evaluation and offers information that 
could be useful in designing alternative evaluation systems. 

A common method of measuring student learning for teacher evaluation is the student 
growth percentile model, which assigns each student a percentile rank in the distribution 
of assessment scores for students at the same grade level and with a similar achievement 
history. The median student growth percentile of a teacher’s students is the teacher-level 
growth score, which tends to be used for teacher evaluation. 

This study, requested by the Nevada Department of Education, investigates the stability of 
the teacher-level growth score. Three years of math and reading score data were analyzed 
for close to 370 elementary and middle school teachers from Washoe County School Dis- 
trict, Nevada’s second largest school district. 

In math, half the variance in teacher scores in any given year was attributable to differ- 
ences among teachers, and half was random or unstable. In reading, the proportion of 
the variance attributable to differences among teachers was .41, and .59 was random or 
unstable. 

More stable measures of effectiveness can be constructed by averaging multiple years of 
growth scores for a teacher. Eor example, when effectiveness is computed as an average of 
annual scores for three years, the proportion of the variance in teacher scores attributable 
to differences among teachers is .75 in math and .68 in reading. 

These estimates do not meet the .85 level of reliability traditionally desired in scores used 
for high-stakes decisions about individuals (Haertel, 2013; Wasserman & Bracken, 2003). 
States that are considering the student growth percentile model for teacher accountability 
may want to be cautious about using the scores for high-stakes decisions. 
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Why this study? 


As of early 2014, 40 states and the District of Columbia were using or piloting methods 
to evaluate teachers in part according to the amount students learn (Collins & Amreim 
Beardsley, 2014). Such evaluations can have high stakes for teachers. In some states the 
consequences of a poor evaluation can include a frozen salary, remediation, or dismissal, 
while consequences for exceptional evaluations can include a bonus, a salary increase, or 
tenure (Herlihy et al., 2014). 


A common method of measuring student learning for teacher evaluation is the student 
growth percentile model developed by Betebenner (2011), which is sometimes referred to 
as the Colorado growth model. It is in various stages of use — from preliminary investiga- 
tion to full-scale adoption — in as many as 20 states (New Jersey Department of Education, 
2012; see also Collins & Amrein-Beardsley, 2014^. This study — requested by the Nevada 
Department of Education, which at the time of writing was planning to use the student 
growth percentile model — investigated the stability over time of teacher-level growth 
scores, which are derived from student growth scores under the student growth percentile 
model. While other research has examined the stability of both school-level scores derived 
from the student growth percentile model and teacher effectiveness measures derived 
from value-added models (see appendix A), the study team could not locate any published 
studies on the stability of teacher-level growth scores derived from the student growth per- 
centile model. 


Examining the 
stability of 
teacher-level 
growth scores 
offers information 
that could be 
useful in selecting, 
weighting, 
and combining 
measures in 
evaluation systems 


The stability of teacher-level growth scores is important to evaluation systems that use the 
scores to measure teacher effectiveness. Underlying such systems is the implicit assumption 
that a teacher’s growth score in one year predicts that teacher’s effectiveness in future years 
(Glazerman et al., 2011). Evaluation systems that identify low-scoring teachers for reme- 
diation assume that if the teachers are not retrained, their future scores will remain low. 
Similarly, systems that award tenure to teachers who score higher assume that those teach- 
ers will continue to be effective. Examining the stability of teacher-level growth scores 
provides evidence of the extent to which this assumption is warranted and offers informa- 
tion that could be useful in selecting, weighting, and combining measures in evaluation 
systems.^ 

In Nevada, teacher-level growth scores are included in accountability models. Initially used 
only for schools, they are now used for teachers as well. In 2009 the Nevada Legislature 
mandated a statewide growth model for school accountability. The Nevada Department 
of Education established selection criteria for the model, including that it be a valid, reli- 
able, and technically sound metric conditioned on students’ past performance when per- 
formance is measured by scores that are not on the same scale from one grade to the 
next (Davidson, Ozdemir, & Harris, 2010). The department ultimately selected the student 
growth percentile model. In this model, growth is not measured as the change in a stu- 
dent’s test scores from one year to the next but as the percentile rank of the student’s score 
in the distribution of the current year’s achievement scores for all students in the state who 
are at the same grade level and who have similar past performance. A grade 6 student’s 
growth score of 40 for math in 2010 would indicate that the student had a 2010 math 
achievement test score equal to or higher than those of 40 percent of the state’s grade 6 
students who had a math achievement history similar to the student’s.^ A student growth 
score of 50 (the median of the distribution) would indicate typical growth, with higher or 
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lower scores indicating greater or less than typical growth (Nevada Department of Educa- 
tion, 2010). 


In 2011 the Nevada State Legislature expanded the use of growth-model data to educator 
evaluation. Student growth scores are used to produce a teachenlevel growth score, which 
is the median of the growth scores for the teacher’s students. A score below 50 indicates 
that the teacher’s typical student scored lower than would be expected for students who 
started the year at a similar achievement level. 


Like other states, Nevada is planning to include in its teacher evaluation system multiple 
measures of teacher effectiveness and multiple years of student outcome data. The analysis 
of the stability of teachenlevel growth scores in this study can inform decisions in Nevada 
and elsewhere about how to incorporate teachenlevel growth scores as a measure of student 
learning.'^ 


What the study examined 


This study examines one overarching research question: How stable over years are annual 
teachenlevel growth scores, derived by applying the student growth percentile model to 
student scores from Nevada’s Criterion-Referenced Tests in math and reading? In other 
words, how likely is it that the same score would be obtained in different years? 

The data, methods, and summary variables used in the report are discussed in box 1. 
Details on how the student achievement variables were constructed are in appendix B, 
and details on the design and methods are in appendix C. 


Analysis of 
the stability of 
teacher-level 
growth scores can 
inform decisions 
about how to 
incorporate 
teacher-level 
growth scores 
as a measure of 
student learning 
into teacher 
evaluation systems 


Box 1. Data, methods, and summary variables 

Data 

Because no statewide datasets were available that linked students to teachers so that 
teacher-level scores could be computed, the study uses data on all students in grades 4-8 
from Washoe County School District — Nevada’s second largest school district, with a student 
enrollment of more than 60,000. The district provided student-level scores linked to teach- 
ers for three school years beginning in 2009/10. Data included the student’s grade, school, 
school level (elementary or middle), teacher (math or English language arts), class (because 
some teachers teach more than one class and some classes are taught by more than one 
teacher), current year’s score on Nevada’s Criterion-Referenced Test, proficiency level associ- 
ated with that score, growth score (that is, student growth percentile) for the current year, and 
an indicator variable that identifies whether the student was enrolled in the school for the full 
school year. A teacher-level dataset for each school year was then prepared that contained 
the teacher ID, school ID, and four variables derived from the student achievement data: the 
percentage of the teacher’s students who were proficient in reading, the percentage who were 
proficient in math, the teacher-level growth score in reading, and the teacher-level growth score 
in math (see appendix B for more information on the variables created for the study). 

(continued) 
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Box 1. Data, methods, and summary variables (continued) 


Methods 

The analysis is based on a generalizability study, which partitions the variance in teacher 
scores into components and estimates the magnitude of each (Brennan, 2001; Shavelson & 
Webb, 1991; see appendix C). Here the three relevant components of variance are true differ- 
ences among teachers, which may provide useful information for decisionmakers; systematic 
year-to-year fluctuations, which affect all teachers’ scores; and random and other sources of 
instability, which cause teachers’ scores to change in unsystematic ways from year to year. 
The last two components could lead to errors in evaluating teachers. 

Summary variables 

The stability of the teacher scores over time is summarized using the reliability coefficient, 
which is the proportion of the total variance in scores that is attributed to the first component 
of variance, the true differences among teachers.^ The coefficient is reported on a scale of 0 
to 1, where 0 means that none of the variance is due to true differences between teachers and 
1 means that all the variance is due to true differences. The higher the reliability coefficient, 
the more stable the scores. For high-stakes decisions about individuals, some researchers 
argue for a reliability coefficient of .85 or higher (Haertel, 2013; Wasserman & Bracken, 2003). 
By comparison, scores for the licensing examination required for nurses are estimated to have 
reliability coefficients of .S7-.92 (National Council of State Boards of Nursing, n.d.), and scores 
for college admissions tests have reliability coefficients ranging from .89 to .93 (College Board, 
2013). 

The magnitude of the total error in scores — errors that are associated with the second and 
third components of variance — is summarized using the standard error of measurement. It can 
be used to establish a score range that is likely to represent a teacher’s effectiveness, much 
like a margin of error is established for results from opinion polls. One commonly used margin 
of error is the 95 percent confidence interval — that is, a range of scores within which there 
is a 95 percent chance that the true score lies (Salkind, 2008). The upper end of this range 
is obtained by adding 1.96 times the standard error of measurement value to the observed 
score, while the lower end is attained by subtracting 1.96 times the standard error of mea- 
surement from the observed score. Because the standard error of measurement is reported in 
the units of the score scale that is being evaluated, it cannot be compared across measures 
with different scales, which means that, unlike for reliability coefficients, there are no general 
guidelines for a target standard error of measurement or 95 percent confidence interval. 

The study also reviews the stability of the status score — the proportion of a teacher’s stu- 
dents who meet grade-level standards — which was used in Nevada’s accountability systems 
before growth scores were introduced. The status score is more familiar to many educators 
and is included in some states’ educator evaluation models, so comparing its stability to that 
of the teacher-level growth score is of some interest. 

Note 

1 . Depending on the methods used to estimate this proportion, the coefficient might be referred to as a reii- 
abiiity coefficient, a generalizability coefficient, or a stability coefficient. These distinctions are not important 
to this report, as the interpretation wouid be the same no matter how the coefficient is iabeied. To simpiify 
the presentation, this report uses the general term reliability coefficient. The calculations are for an absolute 
comparison and not a relative comparison, as explained in appendix C. 
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What the study found 


Nevada’s annual teacher-level growth scores, derived by applying the student growth pen 
centile model to student scores from Nevada’s CriteriomReferenced Tests in math and 
reading, did not meet a level of stability that would traditionally be desired in scores used 
for high'Stakes decisions about individuals. 

This section presents the basic results from the analysis of the stability of annual teacher- 
level growth scores. The statistics underlying the findings are then used to extrapolate the 
likely results if multiple years of data were averaged to derive a teacher score. Additional 
extrapolations project the likely misclassification rates for a particular cutscore; the mis- 
classification rates provide a practical example of the consequences of using teacher-level 
growth scores with low stability. 

Half or more of the variance in teacher-level growth scores was due to random or otherwise unstable 
fluctuations 

No more than half the variance in annual teacher-level growth scores was due to true 
differences among teachers. The proportion of the variance in any given year that was 
attributable to true differences among teachers was .50 in math and .41 in reading; the rest 
was due to random or otherwise unstable fluctuations (table 1). 

The reliability coefficient for status scores was .64 for math and .65 for reading (see table 
1). Thus, the proportion of the variance that was random or otherwise unstable was .36 for 
math and .35 for reading. 

The range of scores likely to include a teacher’s true score would span close to half the 100 point 
scale 

For the annual teacher-level growth scores, the standard error of measurement was 12.22 
for math and 11.31 for reading (see table 1). This means that the 95 percent confidence 
interval for a teacher’s true score would span 48 points for math, a margin of error that 
covers nearly half the 100 point score scale, and 44 points for reading. For example, one 
would be 95 percent confident that the true math score of a teacher who received a score 
of 50 falls between 26 and 74. More precision would be obtained with measures that are 
more stable. 

Even when derived by averaging three years of teacher scores, effectiveness estimates based on 
student growth are unlikely to provide a level of stability desired for use in high-stakes decisions 

The stability of a score increases when the score is derived from averages taken over two 
or three years of data (see table 1).^ This is because the positive and negative errors in 
annual growth scores are averaged, reducing their effect on the teacher-level growth score. 
For example, if the teacher-level growth score were computed using an average of three 
years of data (the maximum number of years before tenure is determined in Nevada), the 
coefficients would be .75 for math and .68 for reading — higher than the coefficients that 
were found for a single annual score (figure 1). 


The proportion 
of the variance 
among teachers in 
any given year that 
was attributabie 
to true differences 
among teachers 
was .50 in math 
and .41 in reading; 
the rest was due 
to random or 
otherwise unstabie 
fiuctuations 
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Table 1. Variance components, coefficients, and standard errors of measurement for teacher-level 
growth scores and teacher-level status scores derived from student achievement scores in math and 
reading 


Estimates derived from 
generaiizabiiity study 

Teacher-level growth score 
(median student growth percentile) 

Teacher-level status score 
(percentage of students who 
meet grade-level standards) 

Math 
(n = 369) 

Reading 
(n = 375) 

Math 
(n = 369) 

Reading 
(n = 375) 

Variance component 

Teacher 

151.71 

90.51 

210.40 

260.58 

Year 

8.87 

3.57 

8.16 

1.50 

Residual 

140.50 

124.41 

108.27 

141.10 

Total 

301.08 

218.49 

326.83 

403.18 

Coefficient® 

Score from one year 

.50 

.41 

.64 

.65 

Average of scores from two years 

.67 

.58 

.78 

.78 

Average of scores from three years 

.75 

.68 

.84 

.85 

Standard error of measurement*’ 

Score from one year 

12.22 

11.31 

10.79 

11.94 

Average of scores from two years 

8.64 

8.00 

7.63 

8.44 

Average of scores from three years 

7.06 

6.53 

6.22 

6.89 


a. Calculated as teacher component/[(teacher component) + (year component/k) + (residual component/k)], where k Is the number of 
annual teacher-level scores that are averaged to produce the final teacher score. 

b. Calculated as the square root of [(year component/k) -i- (residual component/k)], where k is the number of annual teacher-level 
scores that are averaged to produce the final teacher score. 

Source: Authors' analysis of 2009/10, 2010/11, and 2011/12 data provided by Washoe County School District. 


Figure 1. The stability of teacher-level growth scores increases when more annual 
scores are averaged 


Coefficient 

1.00 -I 


■ Math ■ Reading 


0.75 - 


0.50 - 


0.25 - 


0.00 



12 3 

Number of annual scores averaged to estimate growth score 


Source: Authors’ analysis of growth scores (from 2009/10, 2010/11, and 2011/12 data) provided by Washoe 
County School District. 
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About one in five “typicai" math teachers wouid be expected to be misciassified as iow performing 
when ineffective teaching is defined as an annuai growth score iess than 40 


A teacher with a growth score of 50 is “typical” in the sense that the teacher’s median 
student had a growth score at the median of students in the state with a similar achieve- 
ment history. The teacher’s growth scores as measured on any one occasion may differ 
from 50 due to measurement error. Considering the possible scores a teacher could receive 
on a large number of different occasions, the proportion of the teacher’s scores that are 
below a specified cutscore for effectiveness, say 40,® is the proportion of times the typical 
teacher would be misciassified as ineffective (figure 2). 


It is convenient and common to assume a normal distribution of these scores that might 
be observed for a given teacher (Harvill, 1991). In a normal distribution with a mean of 50 
(the teacher’s true score) and a standard deviation of 12.22 (the standard error of measure- 
ment for the annual teacher-level growth score in math, as noted above), 21 percent of the 
scores in the distribution fall below 40; thus, the typical teacher would be expected to be 
misciassified as ineffective 21 percent of the time, when classifications are based on a single 
year of student growth. 

The same potential for misclassiflcation applies when talking about a teacher whose true 
score falls below the cutscore for effectiveness but who, due to measurement error, is iden- 
tified as effective. For more on how the results of a reliability study can be used to estimate 
the number of teachers who are likely to be misciassified in decisions about their effective- 
ness, see appendix D. 


Assuming a normal 
distribution of 
scores and a 
cutscore for 
effectiveness of 40, 
a teacher with a 
growth score of 50 
would be expected 
to be misciassified 
as ineffective 
21 percent of 
the time, when 
classifications 
are based on a 
single year of 
student growth 


Figure 2. Hypothetical distribution of possible growth scores for a typical teacher 
with true growth score at 50 



Source: Authors’ construction. 
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Implications of the study 


This is perhaps the first published study of the stability of the teacherdevel growth score 
derived under the student growth percentile model, a common model used by states in 
teacher evaluation systems. States or districts may have conducted studies to explore the 
model; if so, those studies have not appeared in the published literature or been posted on 
the Internet (see appendix A for related research). The findings indicate that even when 
computed as an average of annual teachenlevel growth scores over three years, estimates 
of teacher effectiveness do not meet the level of stability that some argue is needed for 
high'Stakes decisions about individuals, which is a coefficient of .85 or higher (Haertel, 
2013; Wasserman & Bracken, 2003). The current finding that teachenlevel growth scores 
are so unstable as to raise questions about their use in teacher evaluation systems is similar 
to conclusions that other researchers have drawn about value-added measures of teacher 
effectiveness (for example, American Educational Research Association, 2015; American 
Statistical Association, 2014; Haertel, 2013; Konstantopoulos, 2014). And the finding is 
consistent with research about classroom teaching that has documented how teachers’ 
effects on student learning vary over many dimensions of the classroom, including subject 
matter, students, and occasions (Berliner, 2014; Darling-Hammond, Amrein-Beardsley, 
Haertel, & Rothstein, 2011; Good, 2014). 

The conclusion that growth scores alone may not be sufficiently stable to support high- 
stakes decisions suggests the need to examine measures of teacher effectiveness and their 
interpretation in evaluation systems. The growth score may not be a sound measure of a 
teacher’s effectiveness, or the magnitude of a teacher’s effect on student learning may not 
be as predictable a trait of the teacher as many evaluation systems assume it is. Rather, a 
teacher’s effectiveness may depend in part on features of the teacher’s students — that is, 
the collection of students in any given year, which change from one year to the next (for 
example, Guarino, Reckase, Stacy, & Wooldridge, 2014). Growth measures may need to 
be thought of differently — considered a measure that is associated with a particular com- 
bination of teacher and students rather than one that is attributable to the teacher alone 
(E. Haertel, personal communication, 2012). Thus, as states examine properties of their 
estimates of teacher effectiveness and decisionmakers weigh how to incorporate teacher- 
level growth scores in teacher accountability policy, they may want to exercise caution 
and further investigate whether teacher-level growth scores are sufficiently stable for use in 
high-stakes decisions. 

Many educator evaluation models include multiple measures such as teacher observations, 
surveys, or additional student outcomes. So policymakers may want to consider the sta- 
bility of those other measures and examine the reliability of different combinations of 
measures and the weight assigned to different measures. The methods used in this study 
to extrapolate the stability of different numbers of years of data or misclassification rates 
may be of interest to policymakers as they consider how to refine their educator evaluation 
models. Local reliability statistics can be used in ways analogous to those illustrated here 
to test the reliability of different scenarios under consideration. 


4s states examine 
properties of 
their estimates 
of teacher 
effectiveness and 
decisionmakers 
weigh how to 
incorporate 
teacher-ievei 
growth scores 
in teacher 
accountahiiity 
poiicy, they may 
want to exercise 
caution and 
further investigate 
whether teacher- 
ievei growth scores 
are sufficientiy 
stabie for use 
in high-stakes 
decisions 
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Limitations of the study 


This study has three main limitations. 

First, the study examines scores from teachers in one Nevada school district rather than from 
the statewide population of Nevada teachers. This is because no statewide dataset is available 
that links teachers to students so that teacher scores can be derived from studentlevel data. 
Examining a single district may affect the study in two ways: 

• If Washoe County School District teachers are more homogeneous in teacher 
effectiveness than the population of Nevada teachers, the study would underes- 
timate the stability of the scores. Similarly, if Washoe County School District 
teachers are less homogeneous in teacher effectiveness than the population of 
Nevada teachers, the study would overestimate stability. 

• If Washoe County School District teachers’ scores are more volatile, on average, 
from year to year than those of other teachers in the state, the study would under- 
estimate stability. Similarly, if Washoe County School District teachers’ scores are 
less volatile, on average, from year to year, the study would overestimate stability. 

Second, to examine stability of scores, the study used teachers who have multiple years 
of scores. The sample of teachers who remained in the district teaching a particular topic 
(reading or math) in grades 4-8 for the period of study may differ from a sample with 
teachers who changed districts or moved to untested grades or subjects. For example, 
teachers who stay in the same setting may have the advantage of adapting to that setting 
and may have more stable scores. 

Third, the study examines scores from Nevada’s Criterion Referenced Tests, which were 
the state’s assessment for accountability purposes until 2015/16. Like other states, Nevada is 
moving to new assessments based on the Common Core Standards. How that change will 
affect teacher-level growth scores and the stability of those scores is unknown. 
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Appendix A. Related literature 


Two bodies of research are relevant to this study. One is prior studies of the stability of the 
student growth percentile model, the model that is the focus of this study. Those studies 
focus on the stability of schooHevel scores; they have not examined teacher-level scores. 
The second is studies of the stability of teacher effectiveness as measured through value- 
added models. While value-added models take a different analytic approach from the 
student growth percentile model, they also derive a teacher effectiveness score, and are 
used in some states and districts as part of their educator effectiveness models. 

Prior to this study, no research had been published on the stability of teacher-level growth 
scores derived from the student growth percentile model. But related studies have examined 
the stability of both school-level student growth percentile scores and teacher-effectiveness 
measures derived from value-added models. Goldschmidt, Choi, and Beaudoin (2012) 
estimated the year-to-year stability in school-level growth scores derived from the student 
growth percentile model by correlating school scores from two consecutive years. They 
found correlations of .46 for math in both elementary and middle school samples. The 
correlations for reading were lower: .32 for elementary schools and .22 for middle schools. 
Lash, Peterson, Vineyard, Barrat, and Tran’s (2013) recent generalizability study found 
similar results for school-level student growth percentile scores. They analyzed four years of 
growth scores for the population of elementary and middle schools in Nevada and found 
results (comparable to the correlations in the Goldschmidt et al. [2012] study) of .43 for 
math and .38 for reading. They examined the implications of these results for the accuracy 
of decisions in a school accountability system designed to identify low-performing schools 
and found that if schools with annual school-level growth scores below 40 were classi- 
fied as low-performing schools, 14 percent of the classifications in math and 11 percent in 
reading would likely have been in error.^ In other words, these percentages of schools were 
likely to have been misclassified solely because of measurement instability. 

Other research has examined the year-to-year stability of estimates of teacher effects 
derived from value-added models, a popular alternative to the student growth percentile 
model. In analyzing data from five of Florida’s largest school districts, McCaffrey, Sass, 
Lockwood, and Mihaly (2009) found considerable year-to-year variance in teachers’ value- 
added estimates, even after accounting for some factors that could change annually, such 
as experience and recent in-service professional development. Roughly a third of the top 
20 percent of Florida teachers remained in the top 20 percent the next year, while approx- 
imately a tenth of those who had been in the top 20 percent fell to the bottom 20 percent 
of the teacher effectiveness distribution the next year. 

Other recent studies have shown similar year-to-year shifts in teachers’ value-added rank- 
ings. Newton, Darling-Hammond, Haertel, and Thomas (2010) found that 19-41 percent 
(depending on the value-added model used) of teachers saw their effectiveness rankings 
shift by three or more deciles (that is, by 30 percent or more of the population) in either 
direction from one year to the next. Aaronson, Barrow, and Sander (2007) compared two 
years of rankings of Chicago public school teachers, based on the teachers’ value-added 
estimates. They found that 57 percent of teachers who were ranked in the top quartile in 
the first year also ranked in the top quartile in the second year, while 20 percent dropped 
into the lower half of the quality distribution. Finally, in studying New York City’s data 
reports for teachers with multiple value-added estimates, Corcoran (2010) found that 
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40 percent of the top 20 percent of teachers in 2007 remained in the top 20 percent in 
2008, while 12 percent fell to the bottom 40 percent. 

Such uncertainty can be reduced by averaging annual estimates across years. Schochet 
and Chiang (2010) found that when trying to distinguish a school district’s low- or high- 
performing upper elementary teachers from those with average performance, the rate of 
misclassification (that is, the rate at which teachers were identified as being better or worse 
than they actually were)® was 36 percent when using only one year of data for each teacher, 
26 percent when averaging across three years of data, and 12 percent when using ten years 
of data. 


A-2 


Appendix B. Creating student achievement variabies 


To create the four student achievement variables used for this study, the study team fob 
lowed Washoe County School District rules regarding the inclusion of student achieve- 
ment scores in teachers’ scores, as well as regarding assignment of students to teachers. It 
was critical to follow these rules because the study examines the stability of teacher scores 
as they will be derived by the district for use in Nevada’s teacher evaluation system. The 
rules are: 

• The scores of students who were not enrolled in their school for the full school 
year are excluded. 

• The scores of students who may have transferred between teachers within a school 
are included; a student is considered to be assigned to the teacher whose class the 
student is in at the time of testing.^ 

• A student’s scores are included in the computation of a teacher’s score if the stu- 
dent’s achievement data are coded to the teacher in the dataset, even if the data 
are coded to more than one teacher. 

• The teacher-level scores for teachers who teach more than one class (such as 
middle school teachers who teach multiple math courses each day) or more than 
one grade (such as elementary teachers who teach split-grade classes) are derived 
by pooling all students assigned to the teacher. 

• Teacher-level scores (both status scores and growth scores) may be derived only for 
teachers with at least 10 students who have scores and who have been enrolled in 
the school for the full school year.^° 

Two additional decision rules follow the Washoe County School District’s procedures for 
teachers who change teaching assignments from one year to the next: 

• Teachers who teach different grades within the same school level (elementary or 
middle school) in different years are included (as teacher scores are assumed to be 
independent of grade level in the Washoe County School District).^' 

• Teachers who change schools between study years are included, because the 
student growth percentile model includes no school-level factors.^^ 

The number of teachers in the Washoe County School District files for reading and math 
and the number for whom scores could be computed based on the criteria noted above 
are presented in table Bl. The total number of teachers is the number of teachers in 
grades 4-8 who had at least one student assigned to them for the subject tested (math or 


Table Bl. Number of Washoe County School District teachers of reading and math, 
and number meeting eligibility criteria for the study, by year and across all years 



Math 



Reading 


Year 

Total 

number of 
teachers 

Number 
eligible 
for study 

Percentage 

eligible 

Total 

number of 
teachers 

Number 
eligible 
for study 

Percentage 

eligible 

2009/10 

664 

588 

88.6 

685 

593 

86.6 

2010/11 

677 

615 

90.8 

696 

636 

91.4 

2011/12 

662 

611 

92.3 

674 

628 

93.2 

Eligible across 
the three years 

390 

369 

94.6 

404 

375 

92.8 

Source: Authors' analysis of data provided by Washoe County School District. 
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reading). In each year, at least 86 percent of the teachers were eligible, based on Washoe 
County School District criteria, to have a teachenlevel growth score. In math, 390 teach- 
ers met the inclusion criteria for the three years. Of these, 369, or 95 percent, were eligible 
to have scores computed, and they make up the sample of math teachers in this study. In 
reading, 404 teachers met the inclusion criteria across all three years of the study. Of these, 
375, or 93 percent, were eligible to have scores computed for their class, and they make up 
the sample of reading teachers in this study. 
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Appendix C. Design and methods 


This appendix provides a brief background about the psychometric theory of generalizabih 
ity along with details about how it is applied in this study and how it might be applied in 
similar studies with different assumptions. 

Interpretations of a measurement (for example, a test score) taken on a particular day using 
a particular measurement method are rarely limited to an interpretation of the measure- 
ment to that specific day and measurement method. Instead, inferences are made from that 
single measurement to answer broader questions. For example, a test score can be used to 
answer whether a student has developed the math knowledge expected of students at his 
or her grade or whether the student’s teacher is a capable teacher of math. These are not 
questions about performance on a particular day or about performance assessed by a par- 
ticular method of measurement; they are questions about an individual’s enduring traits. 

When test scores are interpreted, generalizations move from the particular score observed 
to a broader universe of possible scores that could have been observed — for example, those 
that could be achieved on different days, using different measurement methods. Gener- 
alizability theory provides a conceptual framework that is useful in accounting for key 
features, called facets, of the universe of admissible observations. Webb and Shavelson 
(2005) describe the universe of admissible observations as the collection of observations 
that would be acceptable to decisionmakers as substitutes for the particular score that was 
observed. Generalizability theory also provides a statistical method to evaluate the preci- 
sion of a generalization and to examine how changing the way in which measurements are 
sampled will alter the precision of inferences. 

For this generalizability study of teacher effectiveness measures, the study team used a 
simple design that includes only one facet, or source of error: the year in which a teacher 
is measured. The design is shown in table Cl. Rows represent teachers, columns repre- 
sent years, and each cell has one observation, which is the effectiveness measure 
observed for teacher t in year y. This is referred to as a crossed design because each teacher 
is observed each year. In the analysis of the teacher-level growth score, X^^ is the median 
student growth percentile for students of teacher t in year y. In the analysis of the teacher 
status measure, X^^ is the proportion of teacher t’s students who were proficient in the 
subject tested in year y. 

Applying a linear model, X^^ may be represented as the sum of the expected values of 
effects associated with rows (teachers), columns (years), their interaction (teachers by 
years), and random errors. For the single-facet crossed design, the model is expressed for 
the generalizability study as 

= F + (Ft - f) + (Fy - f) + (^ty - Ft - Fy + f) (Cl) 

where is the grand mean, the expectation of X^^ taken over all members of the teacher 
population and all years in the universe of observations; ([t^ - [t) is the effect of teacher t, 
the deviation from the grand mean of the expected value of the teacher’s score taken over 
all years in the universe of observation; ([z - [t) is the effect of year y, the deviation from 
the grand mean of the expected value of the scores for year y taken over all teachers in the 
population; and (X^^ - + [t) is a residual effect, or the portion of the score X^^ that is 
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Table Cl. A single-facet crossed design 


Teacher 



Year 



1 

2 

3 

4 


M 

1 


Xi2 

A3 

A4 


A. 

2 

X^i 

CM 

CM 

X 

A3 

A4 


A. 

3 

^31 

^32 

A3 

A4 


A. 

4 


■^42 

A3 

A4 


A. 


n 

Vi 

V2 

X , 

n 3 

A4 




Source: Authors’ construction. 


not explained by the other effects. The residual effect includes the effect of the interaction 
between teachers and years as well as all other random sources of unexplained measure- 
ment errors. 

The effects may vary. For example, the teacher effect, - [t), may vary across teachers in 
the population. Each effect then has an associated variance, which is called a component 
of variance. The variance component for the effect of teachers is the variance compo- 
nent for the effect of years is and the variance component for the residual effect is 
The variance of taken over all teachers in the population and all years in the universe 
of observations, is the sum of the three variance components. Just as the variance of X^^ is 
the variance of scores taken in a single year, variance components are also associated with 
one year. This is important when the variance components are used to estimate the effects 
of changing the number of years of data that enter a teacher’s score. 

Variance components are the parameters estimated in generalizability theory analyses. 
(For details about the methods to estimate variance components, see Brennan [2001] or 
Cronbach, Gleser, Nanda, and Rajaratnam [1972].) With variance components, it is possi- 
ble to identify how variance in observed scores is expected to be affected by different facets 
of the universe and by sampling different numbers of observations from each facet. It also 
becomes possible to examine how the variance in scores would he affected by different 
types of designs for data collection. As a result, it is possible to consider how to maximize 
the information from a score and minimize the impact of other sources of variance and, 
thus, to design data collection plans that provide dependable measurements. 

A key concept in generalizability theory is that, unlike in classical test theory, there is 
not a single “true” score or “error” score. In classical test theory a person’s observed score 
is simply the sum of a true score and an error score. Generalizability theory recognizes 
multiple sources of influence on scores. Which of those influences enter the true score and 
which enter the error score depend on the use of the score and how it is to be interpreted. 

In the case of the simple single-facet crossed design in this study, there are three effects, 
each having an associated variance component: residual, teacher, and year. The residual 
component is equivalent to the error variance of classical test theory. The teacher com- 
ponent is equivalent to the true score variance of classical test theory. The year compo- 
nent does not have a comparable term in classical test theory because classical test theory 
assumes it to be zero (that is, it assumes strictly parallel forms of tests or measurement 
methods, with each form having the same mean value of scores), while generalizability 
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theory relaxes that assumption. For the teacher-level status measure (derived from student 
scores on the statewide achievement test), the variance component for year would be 
greater than zero if, for example, there had been a change in the statewide test that result- 
ed in the test becoming easier, on average, than previous years’ tests. That type of change 
may be a source of error for some decisions but not for others. It would be a source of 
error when the scores of teachers are used to make a decision that involves an absolute 
comparison, such as the comparison of a teacher’s score to a cutscore that teachers must 
meet in order to be classified as effective. A change in the average level of difficulty of the 
test from one year to the next could change a teacher’s position relative to the criterion, 
even in cases where the teacher’s effectiveness had not changed. In contrast, the change in 
test difficulty would not be a source of error for decisions involving relative comparisons, 
such as a decision to select the top 10 percent of the teachers for awards. The rank order 
of teachers would not be affected by a shift in test difficulty that adds a constant to each 
teacher’s score. 

This study assumes an absolute comparison, and the standard error of measurement 
includes both the year and residual variance components: 


SEM 


abs 


cr^ 

cr^ 

y + 

ty,e 

n 

n 

y 

:y 
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where n is the number of years of data that are averaged to obtain the teacher’s scores. 
The standard error of measurement is the square root of the total error variance, and it 
provides a measure of the error in units of the scale of the teacher score. Since it is report- 
ed in units of the score scale, the magnitude of the standard error of measurement cannot 
be judged independent of that scale. The standard error of measurement is useful in con- 
structing confidence bands or intervals that, with a particular level of certainty, are likely 
to include the true (error-free) score of a teacher. If using a score from a single year of data, 
the standard error of measurement is simply based on the sum of the two variance compo- 
nents. The standard error of measurement is reduced (and reliability increased) if two or 
more years of data are sampled and averaged to obtain a teacher’s score. 

Some states may have systems using a relative comparison, and the standard error of mea- 
surement in those cases includes only the residual component: 


SEM , 

re I 
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The generalizability coefficient ranges from 0 to 1 and is analogous to the reliability coef- 
ficient of classical test theory in that it is defined as a ratio of the variance among teachers 
to the total variance. For absolute decisions, as used in this study, the generalizability coef- 
ficient is^^ 
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(C4) 
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For relative decisions, the generalizability coefficient is 


P'rel (C5) 

g.2 + ^ty,e 

‘ n 

y 

in equations C2-C5 the term n is the number of years of data that enter the teacher’s 
score. By sampling more years, one would expect to reduce the standard error of measure- 
ments and increase the reliability coefficients. Thus, once the components of variance 
have been estimated, the standard error of measurements and generalizability coefficients 
can be estimated for situations that differ in the number of years of data that are averaged 
to obtain a teacher’s score. Thus, equation C4 is the basis for the estimates reported in this 
study for different numbers of years of data. 

The GENOVA software package developed by Brennan (2001) was used to estimate the 
three components of variance (teachers, years, and the variance unexplained by these two 
sources) and to compute two indicators of stability: the generalizability coefficient and the 
standard error of measurement. 
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Appendix D. Calculating misclassifications of effectiveness 


The misclassification example in the findings section identifies the proportion of misclassh 
fications for a “typical” math teacher whose true score was 50 when the cutscore for being 
considered effective was 40. If the true score for each teacher were known, the misclassifi' 
cation rate for each teacher could be calculated using the same method, and the rates could 
be averaged across teachers to obtain the expected misclassification rate for the group as a 
whole. For teachers with a true score above 40, the likelihood of misclassification would be 
the proportion of scores that fall below 40, as in the previous example. For teachers with a 
true score below 40, misclassification would occur when errors caused scores to fall above 
40. 

While a teacher’s true score cannot be measured, it can be estimated. Using the three 
years of scores provided for each teacher, along with the study findings, the study team 
estimated teachers’ true scores for every math teacher in the sample and then computed 
the proportion of classifications of teachers that would be in error when teachers were 
classified as ineffective or effective against a cutscore of 40. The study team looked at two 
types of classification errors: identifying teachers with a true score above the cutscore as 
ineffective and failing to identify ineffective teachers with a true score below the cutscore. 

When classifications are based on one year’s annual teacher-level growth score, the propon 
tion of teachers expected to be misclassified is .14 (table Dl).^^ The proportion of effective 
teachers whose true growth score is 40 or higher and who are likely to be incorrectly claS' 
sified as ineffective is .13. The proportion of ineffective teachers whose true growth score is 
below 40 and who are likely to be incorrectly classified as effective is .42. While the latter 
proportion is high, it represents fewer teachers than the proportion of misclassified effec- 
tive teachers. Some 17 teachers had a true score below 40, so an error rate of .42 means 
that 7 teachers would likely be misclassified. By comparison, 352 teachers had a true score 
of 40 or higher, so a misclassification rate of .13 means that 46 teachers would likely be 
misclassified. 

The expected misclassification rates decline when teacher-level growth scores are derived 
by averaging two or more annual scores (see table Dl). 

The methods used in this example can also be used to examine how changes in the cun 
score will alter expected misclassification rates. 


Table Dl. Expected misclassification rates when identifying ineffective teachers as 
those with a growth score estimate below 40 


Number of years of growth scores 
included in the estimate 

All teachers 
(n = 369) 

Effective teachers 
(n = 352) 

Ineffective teachers 
(n = 17) 

1 

.14 

.13 

.42 

2 

.10 

.09 

.39 

3 

.08 

.06 

.37 


Source: Authors’ analysis of 2009/10, 2010/11, and 2011/12 data provided by Washoe County School 
District. 
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Notes 


1. Collins and Amrein-Beardsley (2014) surveyed all 50 states and the District of Colunt' 
bia to learn whether they were using or piloting a student growth percentile model or 
a value-added model and, if so, which one. The District of Columbia and 22 of the 40 
states that reported using or developing a model identified their model. Of these, 13 
were using or piloting the student growth percentile model. The other 18 of the 40 
states indicated that they were using or developing a growth model or a value-added 
model but did not identify the model. New Jersey Department of Education (2012) 
indicated that as many as 20 states may be considering the student growth percentile 
model. 

2. The stability of a measure over time (the focus of this study) provides information 
about the measure’s reliability (American Educational Research Association, Amer- 
ican Psychological Association, & National Council on Measurement in Education, 
1999; Haertel, 2006). Within an argument-based approach to validity (Kane, 2006, 
2013), this information is also recognized as providing evidence to evaluate the validi- 
ty of the inferences made in the measure’s interpretations. 

3. In creating the comparison group for a student, the analysis takes into account as 
many prior years of achievement test scores as the student has. Grade 4 students, for 
example, would have, at most, one prior year of scores because in Nevada, achieve- 
ment testing begins in grade 3. Grade 8 students would have, at most, five prior years 
of scores. The grouping of students by their achievement histories is not exact — that 
is, rather than looking for students with exact matches in prior scores, the statistical 
method known as regression analysis is used to form approximate groups with similar, 
but not exact, achievement histories. More information about the method can be 
found in Castellano and Ho (2013), as well as in the original citation, Betebenner 
( 2011 ). 

4. As of 2014, at least 50 percent of a Nevada educator’s evaluation was to be based on 
student achievement data from the state’s accountability system, with 45 percent of 
the total evaluation based on scores derived from the student growth percentile model. 
New legislation passed in spring 2015 reduced to 40 percent the weight of student data 
in a teacher’s evaluation and further specified that half of that 40 percent would come 
from district rather than state tests, without specifying the proportion to be based on 
scores from the student growth percentile model. Because of changes to the statewide 
testing program, no student achievement data will be included in teacher evaluations 
for 2015/16, and for 2016/17, 20 percent of the evaluations will be based on student 
achievement data, with half coming from state and half from district tests. The new 
40 percent requirement will be in full effect starting in 2017/18. 

5. The coefficient and standard error of measurement can be estimated for any number 
of years of data by applying the statistics derived from the data and using standard 
assumptions from psychometric theory. Estimates are presented for one, two, and three 
years of data so that educators and policymakers can see how the values change as 
years of data are added. More than three years seems longer than policymakers or 
administrators would want to wait to make a decision about a teacher’s effectiveness. 
Teachers in Nevada achieve tenure in three years; thus, a major decision that might be 
based on the scores examined in this report can use, at most, three years of data. 

6. Cutscores are selected points on the score scale of a test that are used to determine 
whether a particular test score is sufficient for some purpose; for example, student 
performance on a test may be classified into one of several categories such as basic. 
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proficient, or advanced based on cutscores (see, for example, Zieky & Perie, 2006). 
At the time of this study, Nevada has not yet set cutscores for its teacher evaluation 
system. The study team selected a score of 40 as an example because it had been dis- 
cussed by the Washoe County School District as a possible cutscore. This example 
shows how classification errors might be examined for a particular cutscore and, in 
doing so, demonstrates how the standard error of measurement could be used to evah 
uate different error rates for different cutscores during the design of an evaluation 
system. 

7. At the time of the study by Lash et al. (2013), Nevada had not determined a cutscore, or 
criterion score, to identify low-performing schools for the state’s school accountability 
or principal evaluation system. Policymakers were discussing setting the cutscore at 40. 

8. Schochet and Chiang (2010) explain that classification error, in their context, relates 
to the false positive (Type 1) and negative (Type 11) error rates from classical hypoth- 
esis testing. The Type 1 error rate is essentially the probahility that the test of teacher 
effectiveness will erroneously find that a truly average teacher performed significantly 
worse than average — that is, the probability that an average teacher will be erroneous- 
ly identified for remediation. Conversely, the false negative error rate is the probability 
that the test will fail to identify teachers whose true performance is a certain number 
of standard deviations below average — that is, the probability that a low-performing 
teacher will not be identified for remediation even though he or she warrants it. 

9. The district follows this policy because its experience is that the data about transfers 
are unreliable and that there are few within-school transfers after the first few weeks 
of school. 

10. Computation of the student growth score, a student growth percentile, requires that 
a student have a test score in the current year and at least one test score in a previous 
year. Students who have a student growth percentile, then, will have a proficiency score 
for the current year. Students with proficiency scores may be missing a student growth 
percentile if they were never tested previously in Nevada. However, even students new 
to the Washoe County School District will have a student growth percentile if they 
attended school in Nevada in the past, because the Nevada Department of Education 
computes student growth percentile scores in an analysis that pools students from all 
districts in the state. 

11. The study excluded a few teachers who changed school levels because the district was 
interested in analyzing the data separately by school level in the future and wanted 
the samples for those future analyses to contain the teachers in the current sample. 
For this reason, three teachers were excluded from the analysis of math scores and four 
from the analysis of reading scores. 

12. Omitting school effects from a model estimating teacher effects attributes any school 
effect (contextual, direct, or indirect) to teachers. Such effects tend to be left out of 
value-added models for several reasons. For example, it is hard to determine how a 
teacher’s principal or colleagues may have influenced a teacher’s score (Corcoran, 
2010) or how schools are selected by students and teachers (Hanushek & Rivkin, 
2010), and research suggests that the between-school variance in teacher value-added 
models tends to be fairly small compared to the variance within schools (Hanushek & 
Rivkin, 2010). 

13. For absolute decisions, Brennan (2001) refers to the coefficient as an index of depend- 
ability rather than as a generalizability coefficient. For the sake of simplicity, this report 
refers to the coefficients derived for both absolute and relative decisions as a reliability 
coefficient. 
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14. A regression equation known as Kelley’s formula provides a means to estimate the true 

score for each teacher (Hubert & Wainer, 2013): = X + - X), where T) is the 

estimated true score for teacher t, X is the mean of teacher scores, r , is the coefficient 
summarizing the stability of the score, and X^ is the observed score for teacher t. For 
this example the average of the three estimates of a teacher’s growth (rather than one 
of them) was used as the observed score, and thus the coefficient is for the average of 
three annual scores. 

15. As noted, the misclassification results assume a normal distribution of errors with a 
standard deviation equal to the standard error of measurement. They represent the 
proportion of observed scores that would fall below the relevant cutscore for teachers 
whose true scores actually lie above the cutscore, and vice versa — that is, the propon 
tion that would be expected to be judged incorrectly due to measurement error. 
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